Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search

نویسندگان

  • Timothy C. Bell
  • Matt Powell
  • Amar Mukherjee
  • Donald A. Adjeroh
چکیده

This paper explores two techniques for on-line exact pattern matching in files that have been compressed using the Burrows-Wheeler transform. We investigate two approaches. The first is an application of the Boyer-Moore algorithm (Boyer & Moore 1977) to a transformed string. The second approach is based on the observation that the transform effectively contains a sorted list of all substrings of the original text, which can be exploited for very rapid searching using a variant of binary search. Both methods are faster than a decompress-and-search approach for small numbers of queries, and binary search is much faster even for large numbers of queries.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison of BWT Approaches to Compressed-Domain Pattern Matching

A number of algorithms have recently been developed to search files compressed with the Burrows-Wheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while still allowing fast access to the information for searching. We provide a detailed description of five of t...

متن کامل

A comparison of BWT approaches to string pattern matching

Recently a number of algorithms have been developed to search files compressed with the BurrowsWheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while allowing fast access to the information for searching by taking advantage of the sorted nature of BWT files....

متن کامل

LZgrep: a Boyer-Moore string matching tool for Ziv-Lempel compressed text

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the Boyer-Moore approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters t...

متن کامل

Boyer - Moore String Matching over Ziv -

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not ...

متن کامل

Longest - match String Searching for Ziv – Lempel Compression timothy

SUMMARY Ziv–Lempel coding is currently one of the more practical data compression schemes. It operates by replacing a substring of a text with a pointer to its longest previous occurrence in the input, for each coding step. Decoding a compressed file is very fast, but encoding involves searching at each coding step to find the longest match for the next few characters. This paper presents eight...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002